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ABSTRACT 

Motivation: By capturing various biochemical interactions, biological 
pathways provide insight into underlying biological processes. Given 
high-dimensional microarray or RNA-sequencing data, a critical chal- 
lenge is how to integrate them with rich information from pathway 
databases to jointly select relevant pathways and genes for phenotype 
prediction or disease prognosis. Addressing this challenge can help us 
deepen biological understanding of phenotypes and diseases from a 
systems perspective. 

Results: In this article, we propose a novel sparse Bayesian model for 
joint network and node selection. This model integrates information 
from networks (e.g. pathways) and nodes (e.g. genes) by a hybrid of 
conditional and generative components. For the conditional compo- 
nent, we propose a sparse prior based on graph Laplacian matrices, 
each of which encodes detailed correlation structures between net- 
work nodes. For the generative component, we use a spike and slab 
prior over network nodes. The integration of these two components, 
coupled with efficient variational inference, enables the selection of 
networks as well as correlated network nodes in the selected 
networks. 

Simulation results demonstrate improved predictive performance and 
selection accuracy of our method over alternative methods. Based on 
three expression datasets for cancer study and the KEGG pathway 
database, we selected relevant genes and pathways, many of which 
are supported by biological literature. In addition to pathway analysis, 
our method is expected to have a wide range of applications in 
selecting relevant groups of correlated high-dimensional biomarkers. 
Availability: The code can be downloaded at www.cs.purdue.edu/ 
homes/szhe/software. htm I . 
Contact: alanqi@purdue.edu 
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1 INTRODUCTION 

With the popularity of high-throughput biological data such as 
microarray and RNA-sequencing data, many variable selection 
methods — such as lasso (Tibshirani, 1996) and elastic net 
(Zou and Hastie, 2005) — have been proposed and applied to 
select relevant genes for disease diagnosis or prognosis. 
Nevertheless, these approaches ignore invaluable biological 
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pathway information accumulated over decades of research; 
hence, their selection results can be difficult to interpret biologic- 
ally and their predictive performance can be limited by a small 
sample size of expression profiles. To overcome these limitations, 
a promising direction is to integrate expression profiles with rich 
biological knowledge in pathway databases. Because pathways 
organize genes into biologically functional groups and model 
their interactions that capture correlation between genes, this 
information integration can improve not only the predictive 
performance but also interpretability of the selection results. 
Thus, a critical need is to integrate pathway information with 
expression profiles for joint selection of pathways and genes 
associated with a phenotype or disease. 

Despite their success in many applications, previous sparse 
learning methods are limited by several factors for the integra- 
tion of pathway information with expression profiles. For 
example, group lasso (Yuan and Lin, 2007) can be used to utilize 
memberships of genes in pathways via a 1 1/2 norm to select 
groups of genes, but they ignore pathway structural information. 
An excellent work by Li and Li (2008) overcomes this limitation 
by incorporating pathway structures in a Laplacian matrix of a 
global graph to guide the selection of relevant genes. In addition 
to graph Laplacians, binary Markov random field priors can be 
used to represent pathway information to influence gene selec- 
tion (Li and Zhang, 2010; Stingo and Vannucci, 2010; Wei and 
Li, 2007, 2008). These network-regularized approaches do not 
explicitly select pathways. However, not all pathways are rele- 
vant, and pathway selection can yield insight into underlying 
biological processes. A pioneering approach to joint pathway 
and gene selection by Stingo et al. (2011) uses binary Markov 
random field priors and couples gene and pathway selection by 
hard constraints — for example, if a gene is selected, all the path- 
ways it belongs to will be selected. However, this consistency 
constraint might be too rigid from a biological perspective: an 
active gene for cancer progression does not necessarily imply that 
all the pathways it belongs to are active. Given the Markov 
random field priors and the nonlinear constraints, posterior dis- 
tributions are inferred by a Markov Chain Monte Carlo 
(MCMC) method (Stingo et al, 2011). But the convergence of 
MCMC for high-dimensional problems is known to take a long 
time. 

To overcome these limitations, we propose a new sparse 
Bayesian approach, called Network and NOde Selection 
(NaNOS), for joint pathway and gene selection. NaNOS is a 
sparse hybrid Bayesian model that integrates conditional and 
generative components in a principled Bayesian framework 
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(Lasserre et al., 2006). For the conditional component, we use a 
graph Laplacian matrix to encode information of each network 
(e.g. a pathway) and incorporate it into a sparse prior distribu- 
tion to select individual networks. For the generative component, 
we use a spike and slab prior distribution to choose relevant 
nodes (e.g. genes) in selected networks. For this hybrid model, 
we do not impose the hard consistency constraints used by 
Stingo et al. (2011). Furthermore, the prior distribution of our 
model does not contain intractable partition functions. This en- 
ables us to give a full Bayesian treatment over model parameters 
and develop an efficient variational inference algorithm to obtain 
approximate posterior distributions for Bayesian estimation. As 
described in Section 3, our inference algorithm is designed to 
handle both continuous and discrete outcomes. 

Simulation results in Section 4 demonstrate superior perform- 
ance of our method over alternative methods for predicting 
continuous or binary responses, as well as comparable or im- 
proved performance for selecting relevant genes and pathways. 
Furthermore, on real expression data for diffuse large B cell 
lymphoma (DLBCL), pancreatic ductal adenocarcinoma 
(PDAC) and colorectal cancer (CRC), our results yield meaning- 
ful biological interpretations supported by biological literature. 

2 MODEL 

In this section, we present the hybrid Bayesian model, NaNOS, 
for network and node selection. First, let us start from the clas- 
sical variable selection problem. Suppose we have N independent 
and identically distributed samples V = {(xi, t\), . . . , (x N , t N )}, 
where x t and t t are the explanatory variables and the response 
of the z'-th sample, respectively. The explanatory variables can be 
various biomarkers, such as gene expression levels or single-nu- 
cleotide polymorphisms. Following the tradition in variable 
selection, we normalize the values of each variable so that its 
mean and standard deviation are 0 and 1, respectively. The 
response can be certain phenotype or disease status. We aim to 
predict the response vector t = [t\, . . . , t N ] T based on the 
explanatory variables X = [xi, ...,x;v] T and to select a small 
number of variables relevant for the prediction. Because the 
number of variables (e.g. genes) is often much bigger than the 
number of samples, the prediction and selection tasks are statis- 
tically challenging. 

To reduce the difficulty of variable selection, we can use 
valuable information from networks, each of which contains 
certain variables as nodes and represents their interactions. 
For example, biological pathways cluster genes into functional 
groups, revealing various gene interactions. Based on 
M networks, we organize the explanatory variables x z into 
M subvectors, each of which comprises the values of explanatory 
variables in its corresponding network. If a variable (i.e. a gene) 
appears in multiple networks (i.e. pathways), we duplicate its 
value in these networks. Note that networks here are exchange- 
able with graphs; we can use them to represent not only 
biological pathways but also linkage disequilibrium structures 
for genetic variation analysis. 

Our model is a Bayesian hybrid of conditional and generative 
models based on a general framework proposed by 
(Lasserre et al., 2006). The conditional component selects 
individual networks via 'discriminative' training, the generative 



component chooses relevant nodes in the selected networks and 
the two models are glued together through a joint prior 
distribution, so that the selected networks can guide node selec- 
tion and, in return, the selected nodes can influence network 
selection. 

Specifically, for the conditional model, we use a Gaussian data 
likelihood function for the continuous response 



(1) 



where w are regression weights, each of which represents the 
contribution of the corresponding node to the response, and 
r is the precision parameter. For the unknown variance r, we 
assign an uninformative diffuse Gamma prior, Gam(r|g, h) with 
g = h= 10 6 . 

For the binary response, we use a logistic likelihood 



p(t\X, w) = fl a(xjyy)"[\ - a(x>)] 



i-ti 



(2) 



where ti e {0, 1}, w are classifier weights and c(-) is the logistic 
function [i.e. o{y) = (1 + exp(— y)) -1 ]. Based on the M networks, 
we partition w into M groups, so that w = [wi , . . . , wm] t where 
Wfc are the weights for the explanatory variables in the k-th 
network. 

To incorporate the topological information of a network, we 
use its normalized Laplacian matrix representation. Specifically, 
given an adjacent matrix that represents the edges (i.e. inter- 
actions) between nodes in the k-th network, the normalized 
Laplacian matrix is defined as 



UO'j) = 
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where deg(z') = J2j ^k(U) is the degree of the z'-th node in the 
k-th network. 

Based on the graph Laplacian matrices, we design the follow- 
ing mixture prior over to select relevant networks: 



p(w k \a k ) = A/XwjtlO, JiL* TM/XwjfclO, s 2 l k ) 



\l-0fjfc 



(3) 



where a k is a binary variable indicating whether the k-th network 
is selected, s\ >s 2 , s 2 ~ 0 and Ik is an identity matrix. We set the 
hyperparameters s\ and s 2 based on cross-validation (CV) in our 
experiments. To make sure is strictly positive-definite, we add 
a diagonal matrix 10"% to L^. In (3), captures the 
correlation information between nodes in the k-th network. 
Note that if we replace by I& in the slab component, the 
prior (3) becomes a simple generalization of the classical spike 
and slab prior (George and McCulloch, 1997) for group 
selection. When a& = 1, the k-th network is selected and the 
elements of are encouraged to be similar to each other due 
to the Laplacian matrix L^; when ctk = 0, because s 2 is close to 
zero, the corresponding Gaussian prior prunes w^. We use a 
Bernoulli prior distribution to reflect the uncertainty in 
a^, p(dk) = (ukf k (\ — Uk) l ~ ak where Uk e [0, 1] is the selection 
probability. Without any prior preference over selecting or prun- 
ing the k-th network, we assign a uniform prior over u k \ 
p(iik) = 1 [i.e. p{uk) = Beta(w^; a, b) where a = b = 1]. 
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To identify relevant nodes, we introduce a latent vector w& in 
the generative model for each network k, which is tightly linked 
to v/k as explained later. We use a spike and slab prior: 



Pk 

7=1 

Pk 

7=1 

= /<0|w*,&) 



(4) 



where p k is the number of nodes in the k-th network, r 2 ~ 0 and 
Pkj is a binary variable indicating whether to select the y'-th node 
in the k-th network. We give a Bernoulli prior, 

p{p kj ) = (vkj)P kJ (\ — Vkj) l ~^ kJ , and a uniform prior over v kJ : 
p(vkj) = 1 (i.e. p(vkj) = Beta(v^-|c, d) where c = d=\). As 
shown above, the spike and slab prior p(\Vk\f3 k ) has the same 
form as p(0\\Vk, f3 k ), which can be viewed as a generative 
model — in other words, the observation 0 is sampled from w^. 
This view enables us to combine the sparse conditional model for 
network selection with the sparse generative model for node 
selection via a principled hybrid Bayesian model. 

Specifically, to link the conditional and generative models 
together, we introduce a prior on w^: 



p(w k \w k ) = Af(w k \w k , XI) 



(5) 



where the variance X controls how similar and are in our 
joint model. For simplicity, we set X = 0 so that 
p(wk\v?k) = <5(Wfc - Wfc) where 8(f) = 1 if f=0 and 8(f) = 0 
otherwise. The graphical model representation of the joint 
model is given in Figure 1. 

The network and node selections are consistent with each 
other in a probabilistic sense. If a network is pruned, all its 
node are removed. Because = is enforced by the prior 
<5(wfc — w&), when a& = 0, = 0 implies = 0. As a result, 
the spike component in (4) will be selected for all the nodes in 
the k-th network (i.e. = 0 for j = 1, . . . ,pk) with a higher 
probability than the slab component. On the other hand, it is 
easy to see that if one or multiple nodes in a network are selected, 
then this network will be selected too. Note that if a node 
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k = l,...,M 



: 1,...,JV 



appears in multiple networks and is selected, our model will 
not force all the networks that contain this node to be chosen. 
The reason is that we duplicate the value of this node in the 
networks and treat their corresponding regression or classifica- 
tion weights as separate model parameters. 



3 ALGORITHM 

In this section, we present the variational Bayesian algorithm for 
model estimation. Specifically, we develop the variational 
updates to efficiently approximate the posterior distribution of 
weights w, the network- selection indicators a, the node- selection 
indicators p, the network- and node-selection probabilities u and 
v and the precision parameter r for regression. Based on the 
posteriors of a and ft, we can decide which networks and 
nodes are selected. 

For regression, based on the model specification in Section 2, 
the posterior distribution of our model is 

p(w, w, a, ft, u, v, r|t, X) 
= ^./V(t|Xw, r- 1 I)Gamma(r)- 

~\p(\v k \a k )p(\v k \\v k )p(0\yv k , 0 k ) BQm(a k \u k ) Beta(^)- ( 6 ) 

k 

~[ BernO%| v*/)Beta(v#) 

j 

where p(y/k\oik) and p(0\vtk,Pk) are defined in (3) and (4), 
p(yvk\vrk) = — Yfk) and Z is the normalization constant. 
For classification, the posterior distribution is similar to (6), 
except that we replace the Gaussian likelihood (1) by the logistic 
function (2) and remove the precision parameter r and its prior 
for regression in (6). 

Classical Markov chain Monte Carlo methods can be applied 
to approximate the posterior distribution. However, given the 
high dimensionality of the parameters (e.g. w and a), it would 
take a long time for a sampler to converge. In practice, it is even 
difficult to judge the sampler's convergence. Thus, we resort to a 
computationally efficient variational approximation to (6). 

Specifically, we approximate the exact posterior 
distribution in (6) by a factorized distribution: Q(0) = 
6(w)6(a)S(0)G(u)g(v)g(r), where 0 denotes all the latent vari- 
ables. Note that, for classification, we do not have Q r (r). 
Because we set /?(w|w) = <5(w — w), we do not need a separate 
distribution Q(w). To solve Q(0), we minimize the Kullback- 
Leibler (KL) divergence between the exact and approximate 
posterior distributions of 0: 



KL(6(0)l|/<0|t,X))= / e@) In 



Q(Q) 
- P (e\t,xy 



de 



(7) 



Fig. 1. The graphical model representation of NaNOS 



Applying coordinate descent for the minimization of (7), we 
obtain efficient updates for the variational distributions as 
described in the following sections. The updates are iterative: 
we update one of the variational distributions at a time while 
having all the other variational distributions fixed, and iterate 
these updates until convergence. Because these updates 
monotonically decrease the value of the KL divergence (7), 
which is lower bounded by zero, they are guaranteed to converge 
in terms of the KL value (Bishop, 2006). 
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3.1 Regression 

The variational distributions for regression have the following 
forms: 

g( w )=Af(w|m,E) (8) 

Q(«) = \\ k Y a k "^-Yk) l - a " (9) 

eo 8 ) = n* n, (i^q - ^~ h ' o°) 

ewa^w'ii-utf'- 1 (ii) 

2w a n* n, (^'""'(i - v*/'- 1 (i2) 



Q(r) = T(r\g,h) (13) 
Their parameters are iteratively updated as follows: 

S = (A + (r^X)" 1 m = (r)SX T t (14) 



flife = Yk + 0 
4/ = rjkj + c 



ftfc = l-)* + * (15) 
d kj =\-r] kj + d (16) 



^ = 1/(1 + exp((ln(l - u k )) - {\nu k ) + ^ln^ 

2 ^2 



(17) 

-ln|L,| + -tr((w^)(-L fc --I,)) 



Ikj = + exp ((ln(l - v kj )) - (In v kj ) 



h = h + -t T t- m T X T t + - V x7(ww T ) X/ 



• = g + - 



(18) 



(19) 



(20) 



where A = ^diag({^Uy + ^diag({(l - yk)lkW+ £ diagfo) + 
^diag(l — 77) [note that diag({y^L^}^) is a block-diagonal ma- 
trix], (•) means expectation over the corresponding variational 
distribution, and the required moments in the above equations 
are 

(ww T ) = S + mm T (r) = g/h 

(Inn*) = V<4) - f(h) (ln(l - 1/*)) = Vfe) - V^fe) 
(In v kj ) = f{c kj ) ~ Wkj) (ln(l - v kj )) = if(d kj ) - Wkj) 
where f(x) = ^ln r(x), 4 = 4 + 4 and/*/ = c kj + 4/- 



3.2 Classification 

Compared with regression, the classification task is more 
challenging. Because of the logistic function (2), we cannot dir- 
ectly solve the variational distribution Q(w). Therefore, we use a 



> org) exp (- 



lower bound proposed by (Jaakkola and Jordan, 2000) to replace 
the logistic function in the joint distribution: 

oOO'O-aG/)) 1 -' 

,(2t- l)y-£ 9 9 9 \ ( 21 ) 

-^^-/gX(2^-l)V-§ 2 )) 

where fix) = ^tanh(£/2), and § is a variational parameter. Note 
that the equality is achieved when £ = (2f — 1)jf. Because the 
logarithm of the lower bound (21) is quadratic in y, it essentially 
converts the logistic function into a Gaussian form so that the 
variational inference becomes tractable. 

Combining the maximization of the lower bound (21) with the 
minimization of the KL divergence (7), we obtain the variational 
updates for classification. They are the same as those for the 
regression task, except for that Q(w) = Af(y/\m, £), now we have 



s = ( A + 2 E/&) x * x 0 m = 2 sxT(2t _ 1} 



(22) 



where A is the same as in the regression. 

In addition, maximization of the lower bound of the logistic 
function gives the update for the variational parameter § f : 



g = xj(ww T )Xi. 



(23) 



3.3 Computational cost 

The computational cost of the proposed algorithm is dominated 
by (14) for regression and (22) for classification. For both cases, 
it takes 0(p 3 ) for matrix inversion to obtain £ and 0(Np + p 2 ) 
to obtain m for each iteration. Thus, the total cost is 0(p 3 + Np) 
and, for most applications where p>N, it simplifies to 0(p 3 ). 

4 EXPERIMENTS 

In this section, we apply NaNOS to synthetic and real gene 
expression data to select pathways (i.e. networks) and genes 
(i.e. nodes), and provide biological analysis of our results. We 
also compare NaNOS with alternative methods, including lasso 
(Tibshirani, 1996), elastic net (Zou and Hastie, 2005), group 
lasso (Jacob et al, 2009; Yuan and Lin, 2007), the network-con- 
strained regularization approach [Li and Li (2008), henceforth 
'LL'] and the sparse Bayesian model with the classical spike and 
slab prior (George and McCulloch, 1997). For lasso and elastic 
net, we used the Glmnet software package (www-s tat. Stanford. 
edu/~tibs/glmnet-matlab/). For group lasso, we treat each path- 
way as a group. To handle genes appearing in multiple pathways 
(i.e. groups), we first duplicated their expression levels for each 
group — as suggested by (Jacob et al., 2009) — and then used the 
SLEP software package (www.public.asu.edu/~jye02/Software/ 
SLEP/) for group lasso estimation. For the spike and slab 
model, we implemented variational inference similar to our 
updates in Section 3. Just as NaNOS, all these software packages 
use the Gaussian likelihood for regression and the logistic likeli- 
hood for classification. We used the default configuration of 
these software packages for the maximum number of iterations, 
initial values and the threshold for convergence. To tune regu- 
larization weights in lasso, group lasso and the LL approach, we 
conducted thorough 10-fold CV on training data (i.e. not using 
the test data) using a large computer cluster. The CV grids on the 
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free parameters are summarized here: for lasso, a = [0 : 0.01 : 1]; 
for elastic net, a = [0 : 0.01 : 1] and p = [0 : 0.01 : 1]; for 
group lasso (both regression and logistic regression), 
a = [0 : 0.01 : 1]; and for the LL approach, k { = [1 : 25 : 300] 
and X2 = [1 : 25 : 300] (we also did a second-level CV after we 
pruned the range of A,i and A 2 values based on the first-level CV). 
Finally, for NaNOS, the CV grids are si = n = [0.1, 1, 3] and 
s 2 = r 2 = [10- 3 ,10- 4 ,10- 5 ,10- 6 ]. 

On the synthetic data for which we knew the true relevant 
pathways, we also compared NaNOS with a popular tool for 
gene set enrichment analysis (GSEA) (Mootha et ai, 2003; 
Subramanian et ai, 2005). We treated each pathway as a set, 
used GSEA's default configuration and applied its suggested 
criterion false discovery rate (FDR) <25% to discover enriched 
pathways. We then identified all the genes in these enriched path- 
ways as target genes. Because GSEA cannot provide predictions 
on responses t, we did not include it for comparison on the real 
data. 

4.1 Simulation studies 

We first compare all the methods on synthetic data in the 
following three experiments. 

Experiment 1. We followed the first and second data gener- 
ation models used by Li and Li (2008). Specifically, we simulated 
expression levels of 200 transcription factors (TFs), each control- 
ling 10 genes in a simple tree-structured regulatory network, and 
assumed that four pathways — including all of their genes — have 
effect on the response t. We sampled the expression levels of each 
TF from a standard normal distribution, x TF ~ J\f(0, 1) and the 
expression level of each gene that this TF regulates from 
J\f(Q.lxTF, 0.51). This implies a correlation of 0.7 between the 
TF and its target genes. 

For the first model with the continuous response, we designed 
a weight vector for each pathway, p = [1, -^=, . . . , -^=], 

corresponding to the TF and 10 genes it regulates, and then 

sampled t as follows: 



w = [5p, — 5p, 3p, 
t = Xw + e 



3p, 0 T 



where e ~ «/V(0, a]) and 0 is a vector of all zeros. 

The second model is the same as the first one, except that the 
genes regulated by the same TF can have either positive or nega- 
tive effect on the response t. Specifically, we set 



1 



1 



10 



For the first and second models, the noise variance was set to be 
o 2 e — (Tijwj)/4 so that the signal-to-noise ratio was 12.85 and 
7.54, respectively. 

For the binary response, we followed the same procedure as 
for the continuous response to generate expression profiles X and 
the parameters w. Then we sampled t from (2). 

For each of the settings, we simulated 100 samples for training 
and 100 samples for test. We repeated the simulation 50 times. 
To evaluate the predictive performance, we calculated the 



prediction mean-squared error for regression and the error 
rate for classification. To examine the accuracy of gene and 
pathway selection, we also computed sensitivity and 
specificity and summarized them in the F\ score, F\ = 2> 
(sensitivity x specificity)/(sensitivity + specificity). The bigger 
the F\ score, the higher the selection accuracy. 

All the results are summarized in Figure 2, in which the error 
bars represent the standard errors. For all the settings, NaNOS 
gives smaller errors and higher F\ scores for gene selection than 
the other methods, except that, for classification of the samples 
from the second data model, NaNOS and group lasso obtain the 
comparable F\ scores. All the improvements are significant under 
the two-sample Mest (P<0.05). We also show the accuracy of 
group lasso, GSEA and NaNOS for pathway selection in 
Figure 5. Again, NaNOS achieves significantly higher selection 
accuracy. Because the LL approach was developed for regression, 
we did not have its classification results. While the LL approach 
uses the topological information of all the pathways, they are 
merged together into a global network for regularization. In con- 
trast, using a sparse prior over individual pathways, NaNOS can 
explicitly select pathways relevant to the response, guiding the 
gene selection. This may contribute to its improved performance. 

Experiment 2. For the second experiment, we did not require 
all genes in relevant pathways to have effect on the response. 
Specifically, we simulated expression levels of 100 TFs, each 
regulating 21 genes in a simple regulatory network. We sampled 
the expression levels of the TFs, the regulated genes and their 
response in the same way as in Experiment 1 , except that we set 



0, ...,0 



for the first data generation model and 




(24) 



for the second data generation model. Note that the last 1 1 zero 
elements in p indicate that the corresponding genes have no effect 
on the response t, even in the four relevant pathways. 

The results for both the continuous and binary responses are 
summarized in Figures 3 and 5. For regression based on the first 
data model, NaNOS and LL obtain the comparable F\ scores; 
for all the other cases, NaNOS significantly outperforms the 
alternative methods in terms of both prediction and selection 
accuracy (P<0.05). 

Experiment 3. Finally, we simulated the data as in Experiment 
2, except that we replaced V2T in the denominators in (24) with 
21, to obtain a weaker regulatory effect of the TF. Again, as 
shown in Figures 4 and 5, NaNOS outperforms the competing 
methods significantly. 

4.2 Application to expression data 

Now we demonstrate the proposed method by analyzing 
gene expression datasets for the cancer studies of DLBCL 
(Rosenwald et ai, 2002), CRC (Ancona et al., 2006) and 
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Fig. 2. Prediction errors and F\ scores for gene selection in Experiment 1. ENet, S&S and GLasso stand for elastic net, the spike and slab model and 
group lasso, respectively; Data 1 and 2 indicate the first and second data generation models 
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Fig. 4. Prediction errors and F\ scores for gene selection in Experiment 3 
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PDAC (Badea et al, 2008). We used the probeset-to-gene map- 
ping provided in these studies. For the CRC and PDAC datasets 
in which multiple probes were mapped to the same genes, we 
took the average expression level of these probes. We used the 
pathway information from the KEGG pathway database (www. 
genome.jp/kegg/pathway.html) by mapping genes from the 
cancer studies into the database, particularly in the categories 
of Environmental Information Processing, Cellular Processes 
and Organismal Systems. 

4.2.1 Diffuse large B cell lymphoma We used gene expression 
profiles of 240 DLBCL patients from an uncensored study in the 
Lymphoma and Leukemia Molecular Profiling Project 
(Rosen wald et al, 2002). From 7399 probes, we found 752 
genes and 46 pathways in the KEGG dataset. The median sur- 
vival time of the patients is 2.8 years after diagnosis and chemo- 
therapy. We used the logarithm of survival times of patients as 
the response variable in our analysis. 

We randomly split the dataset into 120 training and 120 test 
samples 100 times and ran all the competing methods on each 
partition. The test performance is visualized in Figure 6a. 
NaNOS significantly outperforms lasso, elastic net and group 
lasso. Although the results of the LL approach can contain con- 
nected subnetworks, these subnetworks do not necessarily cor- 
respond to (part of) a biological pathway. For instance, they may 
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Fig. 5. F\ scores for pathway selection. 'EXP' stands for 'Experiment' 
and 'D' stands for 'Data model' 



consist of components from multiple overlapped pathways. In 
contrast, NaNOS explicitly selects relevant pathways. Four path- 
ways had the selection posterior probabilities larger than 0.95 
and they were consistently chosen in all the 100 splits. Two of 
these pathways are discussed below. 

First, NaNOS selected the antigen processing and presentation 
pathway. The part of this pathway containing selected genes is 
visualized in Figure 7a. A selected regulator CIITA was shown to 
regulate two classes of antigens MHC I and II in DLBCL (Cycon 
et al, 2009). The loss of MHC II on lymphoma cells — including 
the selected HLA-DMB, -DQB1, -DMA, -DRA, -DRB1, -DPA1, 
-DPB1 and -DQA1 — was shown to be related to poor prognosis 
and reduced survival in DLBCL patients (Rosenwald et al, 2002). 
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Fig. 6. Predictive performance on three gene expression studies of cancer 
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Fig. 7. Examples of part of identified pathways, (a) The antigen processing and presentation pathway for DLBCL; (b) the cell cycle pathway for CRC; 
(c) the TGF-/3 signaling pathway for PDAC. Shaded and unshaded boxes indicate selected and not selected genes, respectively 



The selected MHC I (e.g. HLA-A,-B,-C,-G) was reported to be 
absent from the cell surface, allowing the escape from immuno- 
surveillance of lymphoma (Amiot et aL, 1998). And the selected Ii/ 
CD74 and HLA-DRB were proposed to be monoclonal antibody 
targets for DLBCL drug design (Dupire and Coiffier, 2010). 

Second, NaNOS chose cell adhesion molecules (CAMs). 
Adhesive interactions between lymphocytes and the extracellular 
matrix (ECM) are essential for lymphocytes' migration and 
homing. For example, the selected CD99 is known to be overex- 
pressed in DLBCL and correlated with survival times (Lee et aL, 
2011), and LFA-1 (ITGB2/ITGAL) can bind to ICAM on the 
cell surface and further lead to the invasion of lymphoma cells 
into hepatocytes (Terol et aL, 1999). 

4.2.2 Colorectal cancer We applied our model to a CRC data- 
set (Ancona et aL, 2006). It contains gene expression profiles 
from 22 normal and 25 tumor tissues. We mapped 2455 genes 
from 22283 probes into 67 KEGG pathways. The goal was to 
predict whether a tissue has the CRC or not and select relevant 
pathways and genes. 

We randomly split the dataset into 23 training and 24 test sam- 
ples 50 times and ran all the methods on each partition. The test 
performance is visualized in Figure 6b. Again, based on a two- 
sample Mest, NaNOS outperforms the alternatives significantly 
(P<0.05). Three out of the four pathways with the selection pos- 
terior probabilities larger than 0.95 are discussed below. They 
were selected 20, 50 and 50 times in the 50 splits. 

First, NaNOS selected the cell cycle pathway. This selection is 
consistent with the original result by Ancona et aL (2006). As 
shown in Figure 7b, NaNOS selected mitotic spindle assembly 
related genes. Specifically, Bubl and Madl may regulate the 



checkpoint complex (MCC) containing Mad2, BubRl and 
Bub3. The upregulated MCC may in turn inhibit ability of 
APC/C to ubiquitinate securin and further lead to mitotic event 
extension in CRC (Menssen et aL, 2007). NaNOS also chose 
cyclin/CDK complexes, among which CycD/CDK4 overexpres- 
sion is found in mouse colon tumor and CDK1, CDK2, CycE are 
increased in human CRC (Vermeulen et aL, 2003; Wang et aL, 
1998). NaNOS further identified the minichromosome mainte- 
nance (MCM) complex — including MCM2 and MCM5 — which 
are biomarkers for the CRC stage identification (Giaginis et aL, 
2009). Moreover, the selected TP53 and c-Myc are known to be 
closely related to CRC (Menssen et aL, 2007). 

Second, NaNOS chose the intestinal immune network for IgA 
production. A greatly increased level of IgA — as a result of long- 
term intestinal inflammation — can increase the chance of CRC 
(Rizzo et aL, 2011) and serve as an effective biomarker for early 
diagnosis of CRC (Chalkias et aL, 2011). Also, selected chem- 
kines in this pathway, such as CXCR4 and CXCL12, may con- 
tribute to CRC progression (Sakai et aL, 2012). 

Third, NaNOS selected the cytokine-cytokine receptor inter- 
action pathway as well as several well-known CRC-related 
molecules in this pathway. For instance, CXCL13 is a biomarker 
for stage II CRC prognosis (Agesen et aL, 2012), CXCL10 dra- 
matically increases with CRC progression (Toiyama et aL, 2012) 
and IL10 secreted by CRC cells can accelerate tumor prolifer- 
ation and be used for the prognosis of CRC progression 
(Toiyama et aL, 2010). 

4.2.3 Pancreatic ductal adenocarcinoma This cancer dataset in- 
cludes expression profiles from 39 PDAC and 39 normal subjects 
(Badea et aL, 2008). By mapping 2781 genes from 54677 probes 
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Fig. 8. The predictive performance of NaNOS when the pathway struc- 
tures are inaccurate. When more edges are randomly selected and 
removed from each pathway, the performance of NaNOS degrades 
smoothly, but still better than the competing methods 

into KEGG pathways, we obtained 67 pathways. Our goal was 
to predict whether a subject has the pancreatic cancer and select 
relevant pathways and genes. We randomly split the dataset into 
39 training and 39 test samples 50 times and ran all the methods 
on each partition. The test performance is visualized in Figure 6c. 
Based on a two-sample £-test, NaNOS significantly outperforms 
lasso, elastic net and group lasso. 

To investigate the sensitivity of NaNOS to the structural noise 
in the pathway database, we randomly chose 20, 50, 80 and 
100% edges in each pathway and removed them. We tested 
NaNOS for each case and reported the average test error rate 
in the new Figure 8. As expected, the error rate of NaNOS grad- 
ually increases with more edges being removed because less topo- 
logical information in pathways is available. But NaNOS still 
consistently outperformed all the alternative methods such as 
elastic net, the second best method on this dataset. This experi- 
ment demonstrates (i) that by exploiting subtle correlation infor- 
mation embedded in the pathway topology, NaNOS can boost 
its modeling power and predictive performance, and (ii) that 
NaNOS is robust to small perturbation in pathway topology. 

We also examined the impact of the important prior distribu- 
tions on pathway and gene selection probabilities Uk and vy. As 
described in Section 2, we used the uniform priors [i.e. the 
Beta(l,l) prior] over Uk and v#, indicating no prior preference 
over selecting a pathway or gene or not. The average test error 
based on the uninformative priors is 9.15 ± 0.5, as visualized in 
Figure 6c. If we change the prior to an informative one, 
Beta(l,10) (mean 0.09 and standard deviation 0.083) that 
strongly prefers sparsity, then the average test error increases 
slightly to 10.0 ± 0.4. This minor increase in error may stem 
from the oversparification caused by the sparsity prior that are 
overconfident (suggested by a small variance). Now if we use 
another informative prior Beta(10,l) (mean 0.91 and standard 
deviation 0.083) that strongly prefers dense — instead of 
sparse — estimation, then the average test error increases to 
11.2 ±0.5. This relatively larger error increase is exactly what 
we expected because now the wrong dense prior aims to select 
most pathways and genes. What is important is that, no matter 
which of these two informative priors we chose, NaNOS consist- 
ently outperformed lasso and group lasso in Figure 6c. Between 
these two extreme cases, if we use an uninformative or weak 
sparse prior [e.g. Beta(0.5,0.5)], we find that similar prediction 
error rates were obtained for NaNOS as in Figure 6c. The above 
analysis indicates that NaNOS is robust to the prior choice. 



In addition to using the even splitting strategy with the same 
number of training and test samples, we also tested the perform- 
ance of all the algorithms in another setting with more training 
samples — specifically, 62 training and 16 test samples. We 
repeated the random partitioning 50 times. The average error 
rates for NaNOS, elastic net, lasso and group lasso are 
8.00±0.89, 9.90 ±1.00, 12.0 ±1.0 and 11.0 ± 0.14, respect- 
ively. Again, the two-sample Mest indicates that NaNOS outper- 
forms the alternative methods significantly (P<0.05). 

Three out of the five pathways with the selection posterior 
probabilities larger than 0.95 are discussed below. They were 
selected 35, 50 and 50 times in the 50 splits. 

The first selected pathway was the transforming growth factor 
beta (TGF-/3) signaling pathway. It is essential in epithelial-mes- 
enchymal transition (EMT) — a critical component for develop- 
mental and cancer processes — and related to PDAC (Krantz 
et al, 2012). The selected part of this pathway is visualized in 
Figure 7c. It shows that IFNG, TNF-a, LTBP1, DCN, TGF-£ 
and its receptor TGF-/3 Rl were selected. The TGF-/3 ligand — 
via its receptor — propagates the signal through phosphorylation 
of Smads including the selected Smad 4, which in turn translocate 
into the nucleus and interact with Snail TFs to regulate EMT 
(Krantz et al, 2012). The selected BMP ligand (i.e. BMP2) is 
bound to BMP Rl and R2 receptors to activate Smadl, which 
is in a protein complex including Smad4. Gordon et al (2009) 
showed that in PANC-1 cell line, this protein complex mediates 
EMT partially by increasing the activity of MMP-2. 

The second identified pathway was ECM-receptor interaction. 
It is associated with desmoplastic reaction, a hallmark in PDAC 
(Shields et al, 2012). In this pathway, NaNOS selected the in- 
tegrin receptors— including ITGB1, ITGA2, ITGA3, ITGA5, 
ITGA6 — and the ECM proteins — collagens including COL1A1 
and COL1A2, and laminins including LAMC2 and LAMB3. 
Important interactions among them were revealed in a previous 
study by Weinel et al (1992). 

The third chosen pathway was CAMs. CAMs are pivotal in 
pancreatic cancer invasion by mediating cell-cell signal transduc- 
tion and cell-matrix communication (Keleg et al, 2003). In this 
pathway, the selected molecules include calcium-dependent cad- 
herin family molecules (CDH2, CDH3) and neural-related mol- 
ecules (MAG); both of them have shown to be related to PDAC 
(Kameda et al, 1999; Keleg et al, 2003). 

5 DISCUSSION 

As shown in the previous section, the new Bayesian approach, 
NaNOS, outperformed the alternative sparse learning methods 
on both simulation and real data by a large margin. Now we 
discuss three factors that may contribute to the improved per- 
formance of NaNOS. 

First, the spike and slab prior (3) and its generalization (4) in 
NaNOS separate weight regularization from the selection of vari- 
ables (pathways or genes). Both the (generalized) spike and slab 
prior and elastic net can be viewed as mixture models in which 
one component encourages the selection of variables and the 
other helps remove irrelevant ones. However, unlike the elastic 
net where the weights over l\ and I2 penalty functions are fixed, 
the spike and slab prior has the selection indicators over these 
two components estimated from data. When a variable is 
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selected, the model has a Gaussian prior over its value (i.e. 
weight) that is equivalent to a h regularizer (as in ridge regres- 
sion) and does not shrink the value of the selected variable as l\ 
penalty would do. By contrast, lasso or elastic net, with a fixed 
mixture weight, has sparsity penalty over both pruned and se- 
lected variables, which can greatly shrink the values of selected 
variables and hurt predictive performance. 

Second, NaNOS incorporates correlation structures encoded 
in pathways for variable selection. Specifically, it uses pathway 
structures into the extended spike and slab prior distribution to 
explicitly model the detailed relationships between correlated 
genes. In contrast, lasso and elastic net do not use this valuable 
correlation information in their models. By comparing prediction 
accuracies of NaNOS when 0 and 100% edges are removed from 
pathways (Fig. 8), we can see that the detailed correlation infor- 
mation captured by the pathway topology can greatly improve 
modeling quality. 

Third, NaNOS has the capability of selecting both relevant 
pathways and genes due to its two-layer sparse structure. By 
contrast, with l\/l 2 penalty, group lasso encourages the selection 
of all the genes in chosen pathways, leading to dense estimation. 
This may be undesirable in practice and deteriorate the predictive 
performance of group lasso. NaNOS enhances the flexibility of 
group lasso by conducting sparse estimation at both the pathway 
(or group) and gene levels. Meanwhile, our Bayesian estimation 
effectively avoids overfitting, a problem often plaguing flexible 
models. 

NaNOS has been applied to joint pathway and gene selection 
in this article. Inspired by the seminal works in (Chuang et ai, 
2007; Frohlich et al., 2006; Srivastava et ai, 2008; Zycinski et ai, 
2013), we can use NaNOS in a variety of biomedical applications 
where there are abundant high-dimensional biomarkers of indi- 
vidual samples and other information sources — for example, the 
gene ontology (GO) and protein-protein interaction networks 
information — that capture correlation in the high-dimensional 
space. Here we discuss two approaches to apply NaNOS when 
we have only GO or other group information without network 
topology. The first approach is to compute some distance or 
similarity scores between genes based on the GO information 
[e.g. following the approach by Srivastava et al. (2008)] and 
then estimate the network topology based on a network learning 
method, for example, graphical lasso (Friedman et al, 2008). 
With the estimated network topology, we can compute the 
graph Laplacian matrices and apply NaNOS to select genes 
and groups of genes. The second approach is to directly use 
the group membership information in NaNOS by replacing the 
graph Laplacian matrices with identity matrices. This approach 
becomes useful when we even do not have any information avail- 
able to learn the network topology. As shown in Figure 8, even 
when all the edges were removed and we had only group infor- 
mation, NaNOS still outperformed the second best method, elas- 
tic net, in terms of prediction accuracy. 
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